Identifying and addressing offensive actions in visual communication sessions

ABSTRACT

A processing system having at least one processor may establish a communication session between a first communication system of a first user and a second communication system of a second user, the communication session including first visual content, the first visual content including a first visual representation of the first user, and detecting a first action of the first visual representation in the first visual content in accordance with a first action detection model. The processing system may modify, in response to the detecting the first action, the first visual content in accordance with a first configuration setting of the first user for the communication session, which may include modifying the first action of the first visual representation of the first user in the first visual content. In addition, the processing system may transmit the first visual content that is modified to the second communication system of the second user.

This application is a continuation of U.S. patent application Ser. No.17/176,119, filed Feb. 15, 2021, now U.S. Pat. No. 11,341,775, which isa continuation of U.S. patent application Ser. No. 16/171,944, filed onOct. 26, 2018, now U.S. Pat. No. 10,922,534, both of which are hereinincorporated by reference in its entirety.

The present disclosure relates generally to visual communicationsessions, and more particularly to methods, computer-readable media, anddevices for detecting and modifying actions of visual representations ofusers in visual content.

BRIEF DESCRIPTION OF THE DRAWINGS

The teachings of the present disclosure can be readily understood byconsidering the following detailed description in conjunction with theaccompanying drawings, in which:

FIG. 1 illustrates an example network related to the present disclosure;

FIG. 2 illustrates a flowchart of an example method for detecting andmodifying actions of visual representations of users in visual content;and

FIG. 3 illustrates a high level block diagram of a computing devicespecifically programmed to perform the steps, functions, blocks and/oroperations described herein.

To facilitate understanding, identical reference numerals have beenused, where possible, to designate identical elements that are common tothe figures.

DETAILED DESCRIPTION

In one example, the present disclosure describes a method,computer-readable medium, and device for detecting and modifying actionsof visual representations of users in visual content. For instance, inone example, a method may include a processing system having at leastone processor establishing a communication session between at least afirst communication system of a first user and a second communicationsystem of a second user, the communication session including firstvisual content, the first visual content including a first visualrepresentation of the first user, and detecting a first action of thefirst visual representation of the first user in the first visualcontent in accordance with a first action detection model for detectingthe first action. The processing system may then modify, in response tothe detecting the first action, the first visual content in accordancewith a first configuration setting of the first user for thecommunication session, which may include modifying the first action ofthe first visual representation of the first user in the first visualcontent based upon the first configuration setting. In addition, theprocessing system may transmit the first visual content that is modifiedto the second communication system of the second user.

Mixed reality (MR), augmented reality (AR), virtual reality (VR), orvideo-based communication sessions, such as calls, video gameenvironments, group hangouts, and the like may include participants whoare inclined to disregard social norms that would typically be employedin everyday personal interactions. A service providing theinfrastructure supporting the communication session may allowparticipants to indicate to the system that another's activity isobjectionable. However, there may be a negative stigma associated withreporting offensive actions that deters some participants from reportingeven clearly objectionable behavior. This lack of reporting may alsoallow the objectionable behavior to continue and affect additionalusers. The service may further rely upon human reviewers to investigatecomplaints, review recorded visual information and other contextinformation, and make determinations as to whether to warn, block, orotherwise address offending participants. Determining what is offensiveis a subjective determination that is difficult to enforce consistentlyand may subject the service provider to criticism from the complainingparty or the accused party, either or both of whom may be dissatisfiedwith the service provider's solution to handling the complaint.

Examples of the present disclosure include a processing system thatsupports visual communication sessions and that detects and addressesactions that are deemed offensive within the personal opinions ofparticular participants. In one example, the processing system passivelyand continuously observes participants' activities on the platform todetermine contexts and particular actions. In one example, theprocessing system may maintain action detection models for detectingrespective objectionable actions but may not require that these actionsbe specifically labeled as particular actions (e.g., offensive gesture,offensive end-zone dancing, etc.). For example, the type of actiondetection model may remain unlabeled, yet future instances of the actionmay be blocked/manipulated in the visual content. In one example, theprocessing system engages in continuous learning of anomalous andpotentially objectionable actions from the visual content, withoutinitially marking events as objectionable.

The types of features from which action detection models may be derivedmay include visual features from visual content segments. For instance,segments of visual content having “unusual” features may be determinedvia a comparison of features from one or more frames in a given timewindow versus “normal” or average features from a larger time period.The features may include low-level invariant image data, such as colors(e.g., RGB (red-green-blue) or CYM (cyan-yellow-magenta) raw data(luminance values) from a CCD/photo-sensor array), shapes, colormoments, color histograms, edge distribution histograms, etc. Visualfeatures may also relate to movement in a video and may include changeswithin images and between images in a sequence (e.g., video frames or asequence of still image shots), such as color histogram differences or achange in color distribution, edge change ratios, standard deviation ofpixel intensities, contrast, average brightness, and the like. In oneexample, the system may perform image salience detection processes,e.g., applying an image salience model and then performing an imagerecognition algorithm over the “salient” portion of the image(s). Thus,in one example, visual features may also include a recognized object(e.g., including parts of a human body such as legs, arms, hands, etc.),a length to width ratio of an object, a velocity of an object estimatedfrom a sequence of images (e.g., video frames), and so forth. Featuresmay additionally be taken from wearable device inputs such as gyroscopeand compass measurements from various points of a human body, eyemovements, and so forth.

In one example, an action detection model, or “signature” may be createdthat represents a particular action. The action detection model maycomprise a machine learning algorithm (MLA), or machine learning model(MLM) trained via the MLA and which may comprise, for example, a deeplearning neural network, or deep neural network (DNN), a generativeadversarial network (GAN), a support vector machine (SVM), e.g., abinary, non-binary, or multi-class classifier, a linear or non-linearclassifier, and so forth. In one example, the MLA may incorporate anexponential smoothing algorithm (such as double exponential smoothing,triple exponential smoothing, e.g., Holt-Winters smoothing, and soforth), reinforcement learning (e.g., using positive and negativeexamples after deployment as a MLM), and so forth. It should be notedthat various other types of MLAs and/or MLMs may be implemented inexamples of the present disclosure, such as k-means clustering and/ork-nearest neighbor (KNN) predictive models, support vector machine(SVM)-based classifiers, e.g., a binary classifier and/or a linearbinary classifier, a multi-class classifier, a kernel-based SVM, etc., adistance-based classifier, e.g., a Euclidean distance-based classifier,or the like, and so on. In one example, the signature may include thosefeatures which are determined to be the most distinguishing features ofthe action, e.g., those features which are quantitatively the mostdifferent from what is considered statistically normal or average fromvisual content associated with a given participant, a group ofparticipants, a given context, and/or in general, e.g., the top 20features, the top 50 features, etc.

In one example, an action detection model, or “signature” may be createdthat represents multiple detected actions having a threshold similarity.In other words, the multiple detected actions are considered to beunique occurrences of a same action, or a same type of action. Forinstance, the action signature may comprise a machine learning model(MLM) that is trained based upon the plurality of features from aplurality of the same and/or similar events. For example, each of thesimilar events may comprise a set of features used as a positive examplethat is applied to a machine learning algorithm (MLA) to generate theaction signature (e.g., a MLM). In one example, the positive examplesused to train the MLM may be determined to be “similar” in accordancewith an unsupervised, supervised, and/or semi-supervised clusteringalgorithm. In one example, the event detection model may be representedas an MLM comprising the average features of a cluster of the pluralityof similar events in a feature space, a cluster centroid, or the like.

In one example, if an action becomes frequently observed and results innegative experiences for one or more users, the action can be identifiedas a negative action. To illustrate, a three finger gesture may havenegative meaning in certain cultures, but not in other cultures. Theprocessing system may detect occurrences of this action identified fromsimilar patterns in segments of the visual content, cluster theseoccurrences and the features thereof, and create an action detectionmodel comprising these features. The processing system may also receiveinputs from users associated with these visual content segments andlearn that this type of action has a negative effect on such users.Namely, some users may find such gesture offensive while others may not.

Moving forward, the processing system may then detect occurrences of theaction in visual content, and block or otherwise address the occurrencesin accordance with the preferences of one or more users. For instance,in one example, the action detection model (e.g., a MLM) may be appliedto process outbound and/or inbound visual content and to identifypatterns in the features of the visual content that match the actiondetection model/signature. In one example, a match may be determinedusing any of the visual features and/or other features mentioned above.For instance, a match may be determined when there is a thresholdmeasure of similarity among the features of the visual content and theaction detection model. In one example, the threshold measure ofsimilarity may alternatively or additionally include matching additionalfeatures associated with measurements from wearable devices and/or othersensors. In one example, the features from the visual content and/oradditional features may be analyzed using a time-based sliding window.Thus, the next time there is a similar sequence of events, e.g., similarimagery and/or movements as recorded by wearable devices and/or othersensors, it may be associated with the action type and may be identifiedas a potential additional occurrence of the same action.

When an additional occurrence of the action is detected, all or aportion of a visual representation of a participant performing theoffensive action may be blocked, a portion of the visual representationof the participant performing the offensive action may be modified(e.g., blurred, replaced, or substituted), and so on. In one example,objectionable actions may be addressed in both outbound and inbounddirections at a user's communication system or in a network-basedprocessing system. For instance, the objectionable action may beaddressed at the offending user's communication system (e.g., byblocking, replacing, obfuscating, etc.) and/or with the same or similarremedial measures at the recipient's communication system. In oneexample, outbound filtering at the offending participant's communicationsystem may be in accordance with the offending participant's own set ofconfiguration settings identifying actions that are consideredobjectionable by the participant. For example, the participant mayutilize the visual communication session for work or professionalpurposes and may wish to self-censor certain actions that theparticipant may inadvertently perform, but which the participant wouldprefer that others not see.

The types of remediation may be selected by default or may beuser-specified. For instance, a participant may flag an action asoffensive and may provide additional input that subsequent occurrencesof the action should be blocked from visual content (e.g., inboundand/or outbound). However, if the participant flags the action asobjectionable but does not specify how to address future occurrences ofthe action, the processing system may implement a default response suchas blurring out the pertinent action in the visual content.

Examples of the present disclosure improve social interactions byautomated and preemptive filtering of offensive actions instead ofmanual annotation and faulty auditing. Examples of the presentdisclosure also prevent spoofing/anonymization to avoid detection.Although examples of the present disclosure primarily provide automateddetection and remediation of offensive actions, in one example, thepresent disclosure may further include a dashboard for a moderator of amultiplayer video game (e.g., a VR game) or other visual communicationservices, where the dashboard provides a view of offensive actions alongwith manually selectable options to moderate accordingly for the variousexperiences and products under the moderator's purview.

In one example, a participant may be enrolled for objectionable actionfiltering and a profile created for the participant. The profile caninitially be bootstrapped from an existing sample profile (e.g., anemployer-provided profile, an age-based or other demographic-basedprofiles, etc.) or entered/customized by the participant during setup.In one example, the processing system may collect interaction datapertaining to interactions of the participant (visual communicationsessions, traditional voice calls, text/Short Message Service (SMS)messages, emails, and so forth). The processing system may then populatethe participant's profile with actions that are typical/atypical,offensive/non-offensive, etc., with respect to the participant's socialcircle. Similarly, in one example, the processing system may assign aparticipant to a category based upon the participant's other networkusage, such as online purchases made or shopping items viewed, websitesvisited, and so forth, and may then assign a profile to the participantbased upon the participant's categorization. Alternatively, or inaddition, the processing system may continue to monitor network usagedata for the participant and update the categorization and associatedprofile for the participant if and when such categorization changes. Forinstance, the participant may have a change in habits which maycorrespond to a generally more restrictive or permissive level ofoffense that can be adapted to by the processing system. In one example,the processing system may not monitor the participant's network usage,but may periodically subscribe to a service to receive categorizationupdates for the participant and assign a profile matching the currentcategorization. In one example, the participant can provide additionalexamples of accept/reject criterion for the processing system to learn(e.g., prior to the participant actually engaging in visualcommunication sessions supported by the processing system).

In one example, the processing system learns new actions for detectionand classification through participant-labeled examples. For example, anoffensive action (e.g., one that is actually offensive to a participant,one that is questionable and which the participant believes may beoffensive to others, etc.) may be signaled as such by an input from theparticipant. For instance, the participant may provide an input to theprocessing system via a keyboard or mouse, via a voice command, using agesture captured via a wearable computing device, and so on. In oneexample, a participant may signal an action to be positive, negative, orneutral.

For a negative/objectionable flagged action, the processing system maylabel the action, and create and activate an action detection model(e.g., a MLM) for detecting subsequent occurrences of the action. In oneexample, learning of an action detection model for a new action can bespecific to a particular participant, can include multiple participantsof a given social group or other segments of participants, or can beplatform-wide. For instance, multiple user labels for the same and/orsimilar actions may be pooled, creating a larger aggregate (and morediverse) event detection model that may reduce false alarms from asingle-person input. Alternatively, or in addition, learning of an eventdetection model for a new action can include developing a “lite” versionon a local client (e.g., a given participant's communication system) andthen comprehensive tuning of the event detection model may be performedwith respect to inputs from a plurality of participants regardingvarious segments of visual content. In addition, in one exampleparticipants' labels for actions can be weighted based upon theparticipants' respective experience with the visual communicationservice (e.g., number of years as a participant, number of visualcommunication sessions, time spent on the platform, etc.), based uponthe participants' respective reputation scores, and so forth. In oneexample, event detection models (whether specific to a participant, orassociated with and/or used by a group of participants) may be updatedto account for new data and may be redeployed as updated versions.

In one example, a moderator may be contacted by the participant orautomatically notified by the processing system to review recentborderline activities and to apply human judgement for labeling. Forinstance, the processing system may identify trending (or instantaneous)anomalous actions that are labeled as negative or questionable for viewby the moderator via a user interface. Similarly, in one example, theprocessing system may display new trends of actions for participants todiscover new actions that may be offensive, new actions which theparticipants may be interested to learn to keep up with cutting-edgesocio-cultural progression, and so on.

In one example, the processing system may send a notification to anoffending participant when his or her action is detected as an offensiveaction by the processing system (and/or when remediated). In oneexample, instead of or in parallel to blocking or otherwise addressing adetected offensive action, the processing system may provide guidancefor a participant regarding appropriate and inappropriate actions withrespect to a current context. For example, the processing system mayrecommend that a participant performs an example positive action thatmay be calculated to be warranted in a currently detected context. Forinstance, the participant may be engaged in a visual communicationsession with others who have particularly indicated that a given actionis considered to be a “positive” action. In one example, a participantmay include a bot or automated agent acting on behalf of a person ororganization. As such, in one example, feedback from the processingsystem as to positive, negative, or neutral actions may be used to trainthe bot/agent in accordance with one or more machine learning model(s)defining the agent/bot.

In one example, the processing system may alternatively or additionallynotify a receiver (or sender) of a potential remediation and ask forconsent/authorization to override or to select a non-default remediationoption for the action, e.g., altering the visual representation of theaction to appear differently, rather than simply blocking the actionfrom the video content. In another example, the processing system mayrecommend interactions between participants based on similar profiles,similar flagging of actions as offensive, and so forth. In still anotherexample, if action detection models are running locally on aparticipant's communication system, these models may be transferred toanother device or system comprising a plurality of devices. In oneexample, preemptive remediation is expedited by manual flagging by aparticipant. However, in one example, the processing system may alsolearn negative actions from observing a participant's behavior,reaction, and/or mood after an action (either as the performer of theaction, or as a recipient of visual content from another that includesthe actions). Thus, consistent patterns that adversely affectparticipants and their experiences are also gradually detected and maybe filtered, even without explicit participant feedback.

In one example, the processing system may maintain scores (likelihood)for actions for various participants (e.g., participant 1 is “likely” toperform action X, participant 2 is “highly unlikely” to perform actionX, and so forth). In one example, the processing system may selectivelymaintain different event detection filters as active. For instance, itmay be overwhelming to the processing system to simultaneously maintainactive action detection models for various actions for one or moreparticipants. However, certain actions may be deemed more or less likelyto be detected based upon the identities of the participants and theirrespective scores with regard to various actions. Thus, those actionswhich are deemed offensive by one or more participants and which aremore likely to occur (based upon the scores of one or more participants)may be selected to be active. As just one example, a participant may beparticularly prone to engaging in a given offensive action and theparticipant may have included this action for outbound filtering(self-censoring). The processing system may learn the participant'sproclivity for this particular action based upon detecting the sameaction being performed by the participant in other visual communicationsessions. Thus, the processing system may ensure that the actiondetection model for this action is active because the user is morelikely to engage in this particular offensive action as compared toother offensive actions that the participant wants to be filtered, butthat the participant is less likely to engage in. These and otheraspects of the present disclosure are described in greater detail belowin connection with the examples of FIGS. 1-3.

To further aid in understanding the present disclosure, FIG. 1illustrates an example system 100 in which examples of the presentdisclosure for detecting and modifying actions of visual representationsof users in visual content may operate. The system 100 may include anyone or more types of communication networks, such as a traditionalcircuit switched network (e.g., a public switched telephone network(PSTN)) or a packet network such as an Internet Protocol (IP) network(e.g., an IP Multimedia Subsystem (IMS) network), an asynchronoustransfer mode (ATM) network, a wireless network, a cellular network(e.g., in accordance with 3G, 4G/long term evolution (LTE), 5G, etc.),and the like related to the current disclosure. It should be noted thatan IP network is broadly defined as a network that uses InternetProtocol to exchange data packets. Additional example IP networksinclude Voice over IP (VoIP) networks, Service over IP (SoIP) networks,and the like.

In one example, the system 100 may comprise a network 102, e.g., atelecommunication service provider network, a core network, anenterprise network comprising infrastructure for computing andcommunications services of a business, an educational institution, agovernmental service, or other enterprises. The network 102 may be incommunication with one or more access networks 120 and 122, and theInternet (not shown). In one example, network 102 may combine corenetwork components of a cellular network with components of a tripleplay service network; where triple-play services include telephoneservices, Internet services and television services to subscribers. Forexample, network 102 may functionally comprise a fixed mobileconvergence (FMC) network, e.g., an IP Multimedia Subsystem (IMS)network. In addition, network 102 may functionally comprise a telephonynetwork, e.g., an Internet Protocol/Multi-Protocol Label Switching(IP/MPLS) backbone network utilizing Session Initiation Protocol (SIP)for circuit-switched and Voice over Internet Protocol (VoIP) telephonyservices. Network 102 may further comprise a broadcast televisionnetwork, e.g., a traditional cable provider network or an InternetProtocol Television (IPTV) network, as well as an Internet ServiceProvider (ISP) network. In one example, network 102 may include aplurality of television (TV) servers (e.g., a broadcast server, a cablehead-end), a plurality of content servers, an advertising server (AS),an interactive TV/video on demand (VoD) server, and so forth.

In accordance with the present disclosure, application server (AS) 104may comprise a computing system or server, such as computing system 300depicted in FIG. 3, and may be configured to provide one or moreoperations or functions for detecting and modifying actions of visualrepresentations of users in visual content, as described herein. Itshould be noted that as used herein, the terms “configure,” and“reconfigure” may refer to programming or loading a processing systemwith computer-readable/computer-executable instructions, code, and/orprograms, e.g., in a distributed or non-distributed memory, which whenexecuted by a processor, or processors, of the processing system withina same device or within distributed devices, may cause the processingsystem to perform various functions. Such terms may also encompassproviding variables, data values, tables, objects, or other datastructures or the like which may cause a processing system executingcomputer-readable instructions, code, and/or programs to functiondifferently depending upon the values of the variables or other datastructures that are provided. As referred to herein a “processingsystem” may comprise a computing device including one or moreprocessors, or cores (e.g., as illustrated in FIG. 3 and discussedbelow) or multiple computing devices collectively configured to performvarious steps, functions, and/or operations in accordance with thepresent disclosure.

Thus, although only a single application server (AS) 104 is illustrated,it should be noted that any number of servers may be deployed, and whichmay operate in a distributed and/or coordinated manner as a processingsystem to perform operations for detecting and modifying actions ofvisual representations of users in visual content, in accordance withthe present disclosure. In one example, AS 104 may comprise a physicalstorage device (e.g., a database server), to store various types ofinformation in support of systems for detecting and modifying actions ofvisual representations of users in visual content, in accordance withthe present disclosure. For example, AS 104 may store one or moreconfiguration settings for various users, households, employers, serviceproviders, and so forth that may be processed by AS 104 in connectionwith establishing visual communication sessions, or that may be providedto devices establishing visual communication sessions via AS 104. AS 104may further create and/or store action detection models which may beutilized by users, households, employers, service providers, and soforth in connection with such configuration settings. For ease ofillustration, various additional elements of network 102 are omittedfrom FIG. 1.

In one example, the access networks 120 and 122 may comprise DigitalSubscriber Line (DSL) networks, public switched telephone network (PSTN)access networks, broadband cable access networks, Local Area Networks(LANs), wireless access networks (e.g., an IEEE 802.11/Wi-Fi network andthe like), cellular access networks, 3^(rd) party networks, and thelike. For example, the operator of network 102 may provide a cabletelevision service, an IPTV service, or any other types oftelecommunication service to subscribers via access networks 120 and122. In one example, the access networks 120 and 122 may comprisedifferent types of access networks, may comprise the same type of accessnetwork, or some access networks may be the same type of access networkand others may be different types of access networks. In one example,the network 102 may be operated by a telecommunication network serviceprovider. The network 102 and the access networks 120 and 122 may beoperated by different service providers, the same service provider or acombination thereof, or may be operated by entities having corebusinesses that are not related to telecommunications services, e.g.,corporate, governmental or educational institution LANs, and the like.

In one example, the access network 120 may be in communication with adevice 131. Similarly, access network 122 may be in communication withone or more devices, e.g., device 141. Access networks 120 and 122 maytransmit and receive communications between devices 131 and 141, betweendevices 131 and 141, and application server (AS) 104, other componentsof network 102, devices reachable via the Internet in general, and soforth. In one example, each of devices 131 and 141 may comprise anysingle device or combination of devices that may comprise a userendpoint device. For example, the devices 131 and 141 may each comprisea mobile device, a cellular smart phone, a wearable computing device(e.g., smart glasses) a laptop, a tablet computer, a desktop computer,an application server, a bank or cluster of such devices, and the like.In one example, devices 131 and 141 may each comprise programs, logic orinstructions for performing functions in connection with examples of thepresent disclosure for detecting and modifying actions of visualrepresentations of users in visual content. For example, devices 131 and141 may each comprise a computing system or device, such as computingsystem 300 depicted in FIG. 3, and may be configured to provide one ormore operations or functions in connection with examples of the presentdisclosure for detecting and modifying actions of visual representationsof users in visual content, as described herein.

In one example, the device 131 is associated with a first user (user 1)191 at a first physical environment 130. As illustrated in FIG. 1, thedevice 131 may comprise a wearable computing device (e.g., smartglasses) and may provide a user interface 135 for user 191. Forinstance, device 131 may comprise smart glasses with augmented reality(AR) enhancement capabilities. For example, endpoint device 131 may havea screen and a reflector to project outlining, highlighting, or othervisual markers to the eye(s) of user 191 to be perceived in conjunctionwith the surroundings. In the present example, device 131 may providethree windows 137-139 in the user interface 135. Also associated withuser 191 and/or first physical environment 130 is a camera 132 which maybe facing user 191 and which may capture a video comprising the firstphysical environment 130, including user 191 and other items or objectstherein, such as sticks A and B. In one example, camera 132 maycommunicate with device 131 wirelessly, e.g., to provide a video streamof the first physical environment 130. As an alternative, or inaddition, in one example, device 131 may also comprise an outward facingcamera to capture video of the first physical environment 130 from afield of view in a direction that user 191 is looking.

In one example, the device 131 may present visual content of one or moreother users via user interface 135 (e.g., presented as a plurality ofwindows 137-139 in FIG. 1). In one example, the physical environment 130and user interface 135 may comprise an augmented reality (AR) or a mixedreality (MR) environment, e.g., when the physical environment 130remains visible to user 191 when using device 131, and the visualcontent received from one or more other users is presented spatially inan intelligent manner with respect to the physical environment 130. Inanother example, the user interface 135 may comprise a virtual reality(VR) environment for the user 191. In one example, the componentsassociated with user 191 and/or first physical environment 130 that areused to establish and support a visual communication session may bereferred to as a “communication system.” For instance, a communicationsystem may comprise device 131, or device 131 in conjunction with camera132, device 131 in conjunction with a smartphone or personal computer, awireless router, or the like supporting visual communication sessions ofdevice 131, and so on.

Similarly, device 141 may be associated with a second user (user 2) 192and a third user (user 3) at a second physical environment 140. Asillustrated in FIG. 1, the device 141 may comprise a personal computer,desktop computer, or the like, and may provide a user interface 145 forusers 192 and 193 via a plurality of display screens 147-149. The userinterface 145 may be similar to user interface 135, but may be providedwith physical display screens 147-149 instead of projections of windows137-139. Also associated with users 192 and 193, and/or second physicalenvironment 140, is a camera 142 which may be facing users 192 and 193and which may capture a video comprising the second physical environment140, including users 192 and 193 and other items or objects therein. Inone example, camera 142 may be coupled to device 141 and may provide avideo stream of the second physical environment 140. As illustrated inFIG. 1, user 193 may also have wearable devices/sensors 143 and 144which may measure, record, and/or transmit data related to movement andposition, such as locations, orientations, accelerations, and so forth.For instance, wearable devices/sensors 143 and 144 may each include aGlobal Positioning System (GPS) units, a gyroscope, a compass, one ormore accelerometers, and so forth. In one example, wearabledevices/sensors 143 and 144 may also measure, record, and/or transmitbiometric data, such as a heart rate, a skin conductance, and so on. Inone example, wearable devices/sensors 143 and 144 may includetransceivers for wireless communications, e.g., for Institute forElectrical and Electronics Engineers (IEEE) 802.11 based communications(e.g., “Wi-Fi”), IEEE 802.15 based communications (e.g., “Bluetooth”,“ZigBee”, etc.), cellular communication (e.g., 3G, 4G/LTE, 5G, etc.),and so forth. As such, wearable devices/sensors 143 and 144 may providevarious measurements to device 141 and/or to AS 104 (e.g., via device141 and/or via access network 122).

In one example, devices 131 and 141 may communicate with each otherand/or with AS 104 to establish, maintain/operate, and/or tear-down avisual communication session. In one example, AS 104 and device 131and/or device 141 may operate in a distributed and/or coordinated mannerto perform various steps, functions, and/or operations described herein.To illustrate, AS 104 may establish and maintain visual communicationsessions for various users and may store and implement one or moreconfiguration settings specifying both inbound and outboundmodifications of visual content from the various users. The visualcontent may comprise video content, which may include visual imagery ofa physical environment (e.g., including imagery of one or more users),and which in some cases may further include recorded audio of thephysical environment. In one example, the visual content may alsoinclude virtual reality (VR) and/or augmented reality (AR) (alsoreferred to as mixed reality (MR)) visual content, such as images ofartificial scenery, background, or objects, avatars representing varioususers, and so forth. For instance, AS 104 may maintain for a virtualworld for a massive multi-player online game (MMOG, e.g., a type of“visual communication session”), or the like.

As used herein, the term AR environment or virtual environment, refersto a set of images or sounds that are generated by devices and systemsof the present disclosure and that are presented to users, e.g.,exclusively via an immersive headset and/or earphone or as a supplementto images and sounds that are generated outside of the devices andsystems of the present disclosure, i.e., in the “real-world.” Thus, theterms augmented reality (AR) environment and virtual environment may beused herein to refer to the entire environment experienced by a user,including real-world images and sounds combined with images and soundsof the AR environment/virtual environment. The images and sounds of anAR environment may be referred to as “virtual objects” and may bepresented to users via devices and systems of the present disclosure.While the real world may include other machine generated images andsounds, e.g., animated billboards, music played over loudspeakers, andso forth, these images and sounds are considered part of the“real-world,” in addition to natural sounds and sights such as wavescrashing on a beach, the sound of wind through the trees and thecorresponding image of waving tree branches, the sights and sounds ofwildlife, and so on.

With respect to an avatar representing a user, the avatar may becontrolled by the user and move within a virtual environment using anynumber of forms of input, such as voice commands, a keyboard, a mouse, ajoystick, or the like. Alternatively, or in addition, the avatar may becontrolled via one or more wearable devices of the user. For instance,the avatar may be made to move within the virtual environment inaccordance with movements of the user's body as detected via the one ormore wearable devices. It should be noted that the presentation of theavatar of the user for other users participating in the visualcommunication session may have a fixed relationship to the physicalworld, e.g., a 1:1 ratio of movement/position, may be scaled, e.g., a4:1 ratio of movement position, or may have an arbitrary relationshipwith regard to one or more dimensions or other parameters.

In one example, AS 104 may receive a request to establish a visualcommunication session from device 131 and/or device 141. The visualcommunication session may be established for such devices after AS 104retrieves one or more configuration settings for the user 191, user 192,and/or user 193, determines which configuration setting(s), if any, toapply based upon the context(s), and activates the respective actiondetection models and/or configuration setting(s) which are determined toapply to the context(s). The request may be received via access network120, access network 122, network 102, and/or the Internet in general,and the visual communication session may be provided via any one or moreof the same networks.

The establishment of the visual communication session may includeproviding security keys, tokens, certificates, or the like to encryptand to protect the media streams between devices 131 and 141 when intransit via one or more network and to allow devices 131 and 141 todecrypt and present received video content and/or received userinterface content via user interfaces 135 and 145, respectively. In oneexample, the establishment of the visual communication session mayfurther include reserving network resources of one or more networks(e.g., network 102, access networks 120 and 122, etc.) to support aparticular quality of service (QoS) for the visual communication session(e.g., a certain video resolution, a certain delay measure, and/or acertain packet loss ratio, and so forth). Such reservation of resourcesmay include an assignment of slots in priority queues of one or morerouters, the use of a particular QoS flag in packet headers which mayindicate that packets should be routed with a particular priority level,the establishment and/or use of a certain label-switched path with aguaranteed latency measure for packets of the visual communicationsession, and so forth.

In one example, AS 104 may establish a communication path such thatmedia streams between device 131 and device 141 pass via AS 104, therebyallowing AS 104 to implement modifications to the visual content inaccordance with the applicable configuration setting(s). The one or moreconfiguration settings may be user-specified, may be based upon thecapabilities of devices of user 191 and/or user 192 being used for thevisual communication session, may be provided by an employer or sponsorof a visual communication session service of network 102 and/or AS 104,may be provided by an operator of network 102 or the system 100 ingeneral, and so forth. As just one example, device 131 may provideinformation regarding the capabilities and capacities of device 131 andcamera 132 to AS 104 in connection with a request to establish a visualcommunication session with device 141. AS 104 may send a notification ofthe request to device 141. Similarly, device 141 may provide informationregarding the capabilities and capacities of device 141 and camera 142to AS 104 in connection with a response to the request/notification toestablish the visual communication session.

In one example, a visual communication session may be establishedbetween two or more users, and one or more additional users may requestto join, and be joined to the visual communication session in the sameor a similar manner. Thus, as illustrated in FIG. 1, a visualcommunication session including users 191, 192, and 193, may furtherinclude a fourth user (“user 4”) represented as an avatar of a bird 181,and a fifth user (“user 5”) represented as a human-like avatar 183 inFIG. 1. The visual communication session may be a video call, a groupvideo call, a short or long-lived AR or VR session, e.g., establishedprivately among the users via AS 104 and/or the users' respectivecommunication systems, or hosted by AS 104 for public or semi-publicusage, e.g., a MMOG.

In one example, device 131 and/or device 141 may indicate a purpose forthe visual communication session (e.g., further context) such as a workcollaboration session, a client call, a personal call, etc. In thisregard, the user 191 may have previously provided to AS 104 one or moreconfiguration settings to match to different types of visualcommunication sessions (e.g., different contexts). In one example, AS104 may determine that a configuration setting of user 191 is applicablein the context(s) of the current visual communication session. Thecontext(s) may include, the purpose of the visual communication session,the time of the visual communication session, the parties to the visualcommunication session, biometric data of one or more parties to thevisual communication session, mood data regarding one or more parties tothe visual communication session, and so forth.

In one example, the system 100 supports the creation of action detectionmodels and associated one or more configuration settings. For example,the configuration settings may map actions and action detection modelswith applicable contexts to activate the action detection models andcorresponding modifications to visual content to implement whenrespective actions are detected. The action detection models and the oneor more configuration settings can be created by and/or for a singleuser for application to visual communication sessions of that user, canbe created for a group of users, can be created by the system and madeavailable for selection by users to activate (e.g., model profilesand/or default configuration settings), and so on.

To illustrate, in the example of FIG. 1, user 191 may have previouslyperformed an action involving waving two sticks up and downsimultaneously that the user 191 considered to be offensive. User 191may also have determine that he or she would not like other participantsto see this particular action in future visual communication sessions.As such, the user 191 may have provided an input (e.g., to AS 104) withregard to previous visual content earlier in the visual communicationsession, or in one or more earlier communication sessions, to indicatethat the action, or this type of action, was offensive and should beedited in the future. AS 104 may create an action detection model forthe action, e.g., extract features from the visual content whichdistinguish the action from “normal” visual content and/or content whichdoes not include the action, and then activate the action detectionmodel as a filter for future instances of the action. In addition, AS104 may create a configuration setting for the user 191 to define acorresponding modification to visual content that should be made whenadditional occurrences of this particular action are detected via theaction detection model/filter, e.g., block, obfuscate, replace, etc.

Returning to the illustration of FIG. 1, it can be seen that user 191 isengaging in the action of simultaneously waving two sticks A and B. Thisimagery of user 191 may be captured as visual content by camera 132 andforwarded to AS 104 via device 131. In the present example, AS 104 mayapply the action detection model to the visual content, determine thatit contains an instance of the action, and may then edit the action inthe visual content in accordance with the one or more configurationsettings of user 191. For instance, the user 191 may have indicated toAS 104 to replace instances of the action with non-movement. In thiscase, AS 104 may replace imagery of movement of stick A with anon-moving representation. In addition, AS 104 may then forward thevisual content that has been modified/edited to other participants ofthe visual communication session. For instance, device 141 may receivethe modified visual content from AS 104 and present the modified visualcontent via display screen 147. As illustrated in FIG. 1, in displayscreen 147 user 1 appears to be only moving stick B, but stick A is notmoving. Thus, the presentation of the offensive gesture of user 191 hasbeen prevented in accordance with the wishes of user 191 (e.g., asrecorded in the one or more configuration settings of user 191).

Similarly, the one or more configuration settings of user 191 mayfurther include an action detection model for inbound filtering of anaction involving users simultaneously moving their arms in an in-and-outmanner. Thus, for example, camera 142 may capture imagery of users 192and 193 (e.g., visual content) which includes user 193 making such amotion. The camera may forward the visual content to AS 104 via device141. AS 104 may then apply the action detection model and determine thatthe visual content includes the offensive action. AS 104 may then alsomodify the visual content in accordance with the configuration settingof user 191. For instance, user 191 may have indicated that such anaction should result in blocking of the associated visual imagery of auser performing the offensive action. In this case, AS 104 mayedit/modify the visual content to block imagery of user 193 performingthe action and forward the modified visual content to device 131 forpresentation to user 191. As illustrated in FIG. 1, device 131 maypresent the modified visual content via window 137, which includesimagery of user 192, but user 193 is omitted from the visual contentusing block 182. Notably, the same visual imagery may also betransmitted by AS 104 to communication systems of users 4 and 5,respectively. However, if these users are not offended by the action ofsimultaneously moving arms in an in-and-out manner and do not haveaction detection models activated to detect and filter such an action,then users 4 and 5 may see an unmodified version of the visual contentfrom camera 142 (or versions that are at least not modified inaccordance with the configuration settings for user 191).

It should be noted that in one example, the offensive action of user 193may alternatively or additionally be detected via data from wearabledevices 143 and 144. For example, readings from wearable devices 143 and144 may indicate the motion of the arms of user 193. In addition, theaction detection model may include features relating to wearabledevice/sensor measurements which can be compared to the readings fromwearable devices 143 and 144 to determine a match to the action. Thus,in one example, AS 104 may obtain the measurements from devices 143 and144, via device 141 and/or access network 122, and apply the actiondetection model to the measurements (e.g., as an alternative or inaddition to the visual content from camera 142).

The foregoing describes an example of network-based application of oneor more configuration settings by AS 104. However, it should beunderstood that in other, further, and different examples, theapplication of one or more configuration settings and the modificationsof visual content in accordance with the configuration settings mayalternatively or additionally be applied locally, e.g., at device 131and/or at device 141. It should also be noted that the foregoingdescribes examples of visual content filtering in accordance with users'configuration settings, e.g., applying action detection models,detecting occurrences of actions, modifying visual content, etc.However, in one example, additional filters and/or configurationsettings may be applied for users' outbound and inbound visual contentas directed by employers, head of household/account holders (e.g., forusers who are children), and so forth. For instance, AS 104 may store acatalog of action detection models and/or configuration settings thatmay be selected for application to visual communication sessions ofvarious user and for various contexts. For instance, AS 104 may have aplurality of available machine learning algorithms or for detectingspecific potentially offensive actions and/or a plurality ofconfiguration settings associated with model profiles or defaultprofiles (e.g., sensitive, somewhat sensitive, non-sensitive, etc.,model profiles associated with particular cultures or situations, and soforth). In one example, a default profile may have a plurality of actiondetection models to be applied, while a user selecting the defaultprofile may still specify the type of modifications to apply in responseto a detections of occurrences of the respective associated actions.Accordingly, users, employers, service providers, network operators,etc. may select various configuration settings from such a catalog to beapplied by AS 104 and/or for download and application locally by theuser devices and/or communication systems. In still another example,users or others with an interest and/or permission to applyconfiguration settings may also provide sample actions which may becaptured via video and/or wearable device/sensor measurements from whichan action detection model may be generated. Thus, certain actions may bepreempted without first having to experience the action in aninteractive communication session with other users. Thus, these andother modifications are all contemplated within the scope of the presentdisclosure.

It should also be noted that the system 100 has been simplified. Thus,it should be noted that the system 100 may be implemented in a differentform than that which is illustrated in FIG. 1, or may be expanded byincluding additional endpoint devices, access networks, networkelements, application servers, etc. without altering the scope of thepresent disclosure. In addition, system 100 may be altered to omitvarious elements, substitute elements for devices that perform the sameor similar functions, combine elements that are illustrated as separatedevices, and/or implement network elements as functions that are spreadacross several devices that operate collectively as the respectivenetwork elements. For example, the system 100 may include other networkelements (not shown) such as border elements, routers, switches, policyservers, security devices, gateways, a content distribution network(CDN) and the like. For example, portions of network 102, accessnetworks 120 and 122, and/or Internet may comprise a contentdistribution network (CDN) having ingest servers, edge servers, and thelike for packet-based streaming of video, audio, or other content.Similarly, although only two access networks, 120 and 122 are shown, inother examples, access networks 120 and/or 122 may each comprise aplurality of different access networks that may interface with network102 independently or in a chained manner.

In one example, the system 100 may further include wireless or wiredconnections to sensors, such as temperature sensors, door sensors, lightsensors, movement sensors, etc., to automated devices, such as aerial orvehicular drones (e.g., equipped with global positioning system (GPS)receivers, cameras, microphones, wireless transceivers, and so forth,and which my capture video content of a physical environment), todevices of other users and/or non-participants, and so forth. In anotherexample, device 131 may maintain a first configuration setting when avisual communication session is established. However, a door sensor maycommunicate with device 131 to indicate that a door has been opened(e.g., to a house of user 191). This may indicate that other individualsmay now be imminently present and that at least a second configurationsetting should be applicable/activated, e.g., to apply more stringentfiltering by activating more action detection models and configurationsettings for removing/altering visual content, and so on. Thus, theseand other modifications are all contemplated within the scope of thepresent disclosure.

FIG. 2 illustrates a flowchart of an example method 200 for detectingand modifying actions of visual representations of users in visualcontent, in accordance with the present disclosure. In one example, themethod 200 is performed by a component of the system 100 of FIG. 1, suchas by application server 104, device 131, or device 141, and/or any oneor more components thereof (e.g., a processor, or processors, performingoperations stored in and loaded from a memory), or by application server104, in conjunction with one or more other devices, such as device 131,device 141, and so forth. In one example, the steps, functions, oroperations of method 200 may be performed by a computing device orsystem 300, and/or processor 302 as described in connection with FIG. 3below. For instance, the computing device or system 300 may representany one or more components of application server 104, device 131, ordevice 141 in FIG. 1 that is/are configured to perform the steps,functions and/or operations of the method 200. Similarly, in oneexample, the steps, functions, or operations of method 200 may beperformed by a processing system comprising one or more computingdevices collectively configured to perform various steps, functions,and/or operations of the method 200. For instance, multiple instances ofthe computing device or processing system 300 may collectively functionas a processing system. For illustrative purposes, the method 200 isdescribed in greater detail below in connection with an exampleperformed by a processing system. The method 200 begins in step 205 andproceeds to step 210.

At optional step 210, the processing system may receive a request toestablish a communication session (e.g., a visual communication session)from at least one of a first communication system of a first user or asecond communication system of a second user. The processing system mayinclude at least one processor deployed in the first physicalenvironment and/or at least one processor deployed in a communicationnetwork. The processing system may alternatively or additionallycomprise the first communication system of the first user, the secondcommunication system of the second user, and/or network-basedcomponents. The communication session may be for a video call, a groupvideo call, an AR or VR session, a MMOG, or the like.

At step 220, the processing system establishes a communication sessionbetween at least a first communication system of a first user and asecond communication system of a second user, the communication sessionincluding first visual content, the first visual content including afirst visual representation of the first user. The first visualrepresentation of the first user may comprise a video image of the firstuser or an animated avatar (human-like or non-human-like) associatedwith the first user. In one example, the first visual content isgenerated via the first communication system. In one example, thecommunication session includes second visual content, the second visualcontent including a second visual representation of the second user. Forinstance, the second visual content may be generated via the secondcommunication system of the second user.

It should also be noted that although the terms, “first,” “second,”“third,” etc., are used herein, the use of these terms are intended aslabels only. Thus, the use of a term such as “third” in one example doesnot necessarily imply that the example must in every case include a“first” and/or a “second” of a similar item. In other words, the use ofthe terms “first,” “second,” “third,” and “fourth,” do not imply aparticular number of those items corresponding to those numericalvalues. In addition, the use of the term “third” for example, does notimply a specific sequence or temporal relationship with respect to a“first” and/or a “second” of a particular type of item, unless otherwiseindicated.

At step 230, the processing system detects a first action of the firstvisual representation of the first user in the first visual content inaccordance with a first action detection model for detecting the firstaction. For example, the first action may comprise a gesture or otherpotentially offensive actions. In one example, the first actiondetection model comprises a machine learning model (MLM) for detectingthe first action, wherein the MLM is trained based upon at least oneinput of the first user regarding at least one segment of visual contentincluding at least one visual representation of at least one user. TheMLM may identify, from the at least one segment, features of the atleast one visual representation of the at least one user thatdistinguish the first action from visual content that does not containthe first action. In one example, the features are from a feature spacecomprising quantified aspects of the visual content. Quantified aspectsmay include low-level invariant image data, features relating tomovement in a video, e.g., changes within images and between images,recognized objects (e.g., including parts of a human body such as legs,arms, hands, etc.), a length to width ratio of an object, a velocity ofan object estimated from a sequence of images (e.g., video frames), andso forth. In one example, features may additionally be taken fromwearable device inputs such as gyroscope and compass measurements fromvarious points of a human body, eye movements, and so forth.

The first action detection model/MLM can be trained from input of otherusers regarding actions by various other users. In one example, thefirst user may borrow one of several standard profiles which may includethe first action detection model. In one example, the first actiondetection model may be activated by the processing system for detectionand remediation of an action when more than a threshold number of usersidentify the same or similar action as being offensive. In one example,the number of users may be users who are also utilizing the samestandard profile or profile level, users who self-identify as being asame type of user, users who are participating in a same MMOG, and soforth. For example, a video call can be established between users fromdifferent countries with different customs. Thus, standard profilespertaining to offensive actions from country 1 can be used to filtervisual content containing actions of users of country 2 and vice versa,where standard profiles pertaining to offensive actions from country 2can be used to filter visual content containing actions of users ofcountry 1.

In one example, the processing system may select the first actiondetection model for active use when one or more context criteria aremet. For instance, the processing system may activate the actiondetection model when the context includes one or more of: a physicallocation of the first user, a physical location of the second user, atime of day, a presence of other individuals besides the first user andthe second user in the communication session, a relationship between thefirst user and the second user, a type of task for the communicationsession, a topic of the communication session, and so forth. In oneexample, the context may be that the first user has provided an input tothe processing system indicating that an offensive action wasencountered. The processing system may then generate the actiondetection model/MLM and activate the action detection model when thegenerating is completed. In one example, the processing system maycontinue to refine the action detection model/MLM with each occurrenceof the action that is detected (such as in accordance with step 240below). In addition, the processing system may continue to receive inputfrom the first user and/or other users regarding whether detection of anaction and a corresponding modification of visual content wasappropriately applied. In other words, the processing system utilizesreinforcement learning to utilize new positive examples and/or newnegative examples to enhance the action detection model and itsclassification capability.

At step 240, the processing system modifies, in response to thedetecting the first action, the first visual content in accordance withfirst configuration settings of the first user for the communicationsession. In particular, the modifying may comprise modifying the firstaction of the first visual representation of the first user in the firstvisual content based upon the first configuration settings. For example,the first configuration settings may specify a first modification to beapplied to the first action of the first visual representation of thefirst user in the first visual content (and to other occurrences of thesame action/type of action). The first modification may comprise atleast one of: blocking at least a portion of the first visualrepresentation of the first user in the first visual content,obfuscating at least a portion of the first visual representation of thefirst user in the first visual content, removing at least a portion ofthe first visual representation of the first user in the first visualcontent, or changing the first action of the first visual representationof the first user in the first visual content to a different action. Themodification may be selected by a user, or may be defined in connectionwith a default or standard profile that may be selected by or for auser, and which may include the particular configuration setting and theassociated action detection model.

The first configuration settings (and similarly the first actiondetection model) may be associated with various contexts. For instance,the first configuration settings may be associated with at least one of:a physical location of the first user, a physical location of the seconduser, a time of day, a presence of other individuals besides the firstuser and the second user in the communication session, a relationshipbetween the first user and the second user, a type of task for thecommunication session, a topic of the communication session, and soforth. For example, if the communication session is indicated to be forsports talk between fans of two teams, the configuration settings mayinclude heightened filtering (e.g., application of additional actiondetection models and/or more stringent modifications of the visualcontent) since the first user may be more likely to engage in badactions that he or she may regret. Alternatively, the configurationsettings may be more permissive in terms of filtering, (e.g., loweredstandards) since bad gestures may be expected and intended, versusspeaking with colleagues or customers relating to work, for example. Inone example, different modifications to the first visual content may beindicated for the same action, but for different contexts.

In one example, the first configuration settings may also be selectedfor application based upon a capability of the first communicationsystem, a capability of the second communication system, a capability ofthe processing system, or a capability of a network supporting thecommunication session. For instance, if the processing system is notcapable of modifying the first visual representation of the first userin real-time (e.g., without perceptible delay, jumps in the visualcontent, visible artifacts, etc.), the processing system may select toblock the first visual representation, which may be simpler and requireless time and computing resources than the preferred option of modifyingthe first visual representation of the first user to show a static imagerather than the movement of the offending action.

At step 250, the processing system transmits the first visual contentthat is modified to the second communication system of the second user.In one example, the second communication system is to display the firstvisual content that is modified for the second user.

At optional step 260, the processing system may detect a second actionof the second visual representation of the second user in the secondvisual content in accordance with a second action detection model fordetecting the second action. In one example, the second action detectionmodel comprises a second machine learning model for detecting the secondaction. In one example, the second action detection model may be madeactive in accordance with the first configuration settings of the firstuser. For instance, the first user may have both inbound and outboundfiltering of visual content for the communication session. In oneexample, the second machine learning model is trained based upon atleast one input of a user regarding at least one segment of visualcontent including at least one visual representation of at least oneuser. For instance, the processing system may receive inputs by otherusers who are similar to the first user (e.g., have the same or similarprofiles) and may determine that the second action may be an offensiveaction to the group of similar users. Thus, the second action detectionmodel may be activated for the first user.

At optional step 270, the processing system may modify, in response tothe detecting the second action, the second visual content in accordancewith the first configuration settings of the first user for thecommunication session. For instance, optional step 270 may comprisesimilar operations as described above in connection with step 240.

At optional step 280, the processing system may transmit the secondvisual content that is modified to the first communication system of thefirst user. For instance, optional step 280 may be performed in anexample where the processing system comprises a network-based processingsystem.

At optional step 290, the processing system may present the secondvisual content that is modified. For instance, optional step 290 may beperformed when the processing system is the first communication systemor includes the first communication system (e.g., further comprisingnetwork-based components and/or the second communication system).

Following step 250, or any of the optional steps 260-290, the method 200proceeds to step 295 where the method ends.

It should be noted that the method 200 may be expanded to includeadditional steps, or may be modified to replace steps with differentsteps, to combine steps, to omit steps, to perform steps in a differentorder, and so forth. For instance, in one example the processor mayrepeat one or more steps of the method 200, such as steps 220-250 tocontinue to receive first visual content, to detect the first action, tomodify the first visual content, etc. The processor may similarly repeatsteps 260-280 and/or 260-290 to continue to receive second visualcontent, to detect the second action, to modify the second visualcontent, etc.

In still another example, the method 200 may be expanded to includetopic (e.g., theme and/or concept) detection and then selectingconfiguration settings for the first user and/or the second user inaccordance with the topic. For instance, the processing may apply topicmodels (e.g., classifiers) for a number of topics to the first visualcontent and/or the second visual content to identify a topic. The topicmodel classifiers can be trained from any text, video, image, audioand/or other types of content to recognize various topics, which mayinclude objects like “car,” scenes like “outdoor,” and actions or eventslike “baseball.” Topic identification classifiers may include supportvector machine (SVM) based or non-SVM based classifiers, such as neuralnetwork based classifiers and may utilize the same or similar featuresextracted from the first visual content or the second visual contentthat may be used to identify objects for modification in accordance withfirst configuration settings and/or second configuration settings. Oncea topic is identified, the topic may be further correlated withconfiguration settings (e.g., including action detection models) forwork collaboration, client meeting, family, personal call, etc. Forinstance, a topic of “baseball” may be mapped to configuration settingsfor “personal call” rather than “work collaboration.” The mapping(s) maybe provided by the users, a head of household, an employer, a providerof a visual communication session service, and so forth. Thus, these andother modifications are all contemplated within the scope of the presentdisclosure.

In addition, although not expressly specified above, one or more stepsof the method 200 may include a storing, displaying and/or outputtingstep as required for a particular application. In other words, any data,records, fields, and/or intermediate results discussed in the method canbe stored, displayed and/or outputted to another device as required fora particular application. Furthermore, operations, steps, or blocks inFIG. 3 that recite a determining operation or involve a decision do notnecessarily require that both branches of the determining operation bepracticed. In other words, one of the branches of the determiningoperation can be deemed as an optional step. Furthermore, operations,steps or blocks of the above described method(s) can be combined,separated, and/or performed in a different order from that describedabove, without departing from the example embodiments of the presentdisclosure.

FIG. 3 depicts a high-level block diagram of a computing device orprocessing system specifically programmed to perform the functionsdescribed herein. For example, any one or more components or devicesillustrated in FIG. 1 or described in connection with the method 200 maybe implemented as the processing system 300. As depicted in FIG. 3, theprocessing system 300 comprises one or more hardware processor elements302 (e.g., a microprocessor, a central processing unit (CPU) and thelike), a memory 304, (e.g., random access memory (RAM), read only memory(ROM), a disk drive, an optical drive, a magnetic drive, and/or aUniversal Serial Bus (USB) drive), a module 305 for detecting andmodifying actions of visual representations of users in visual content,and various input/output devices 306, e.g., a camera, a video camera,storage devices, including but not limited to, a tape drive, a floppydrive, a hard disk drive or a compact disk drive, a receiver, atransmitter, a speaker, a display, a speech synthesizer, an output port,and a user input device (such as a keyboard, a keypad, a mouse, and thelike).

Although only one processor element is shown, it should be noted thatthe computing device may employ a plurality of processor elements.Furthermore, although only one computing device is shown in the Figure,if the method(s) as discussed above is implemented in a distributed orparallel manner for a particular illustrative example, i.e., the stepsof the above method(s) or the entire method(s) are implemented acrossmultiple or parallel computing devices, e.g., a processing system, thenthe computing device of this Figure is intended to represent each ofthose multiple general-purpose computers. Furthermore, one or morehardware processors can be utilized in supporting a virtualized orshared computing environment. The virtualized computing environment maysupport one or more virtual machines representing computers, servers, orother computing devices. In such virtualized virtual machines, hardwarecomponents such as hardware processors and computer-readable storagedevices may be virtualized or logically represented. The hardwareprocessor 302 can also be configured or programmed to cause otherdevices to perform one or more operations as discussed above. In otherwords, the hardware processor 302 may serve the function of a centralcontroller directing other devices to perform the one or more operationsas discussed above.

It should be noted that the present disclosure can be implemented insoftware and/or in a combination of software and hardware, e.g., usingapplication specific integrated circuits (ASIC), a programmable logicarray (PLA), including a field-programmable gate array (FPGA), or astate machine deployed on a hardware device, a computing device, or anyother hardware equivalents, e.g., computer readable instructionspertaining to the method(s) discussed above can be used to configure ahardware processor to perform the steps, functions and/or operations ofthe above disclosed method(s). In one example, instructions and data forthe present module or process 305 for detecting and modifying actions ofvisual representations of users in visual content (e.g., a softwareprogram comprising computer-executable instructions) can be loaded intomemory 304 and executed by hardware processor element 302 to implementthe steps, functions or operations as discussed above in connection withthe example method 200. Furthermore, when a hardware processor executesinstructions to perform “operations,” this could include the hardwareprocessor performing the operations directly and/or facilitating,directing, or cooperating with another hardware device or component(e.g., a co-processor and the like) to perform the operations.

The processor executing the computer readable or software instructionsrelating to the above described method(s) can be perceived as aprogrammed processor or a specialized processor. As such, the presentmodule 305 for detecting and modifying actions of visual representationsof users in visual content (including associated data structures) of thepresent disclosure can be stored on a tangible or physical (broadlynon-transitory) computer-readable storage device or medium, e.g.,volatile memory, non-volatile memory, ROM memory, RAM memory, magneticor optical drive, device or diskette and the like. Furthermore, a“tangible” computer-readable storage device or medium comprises aphysical device, a hardware device, or a device that is discernible bythe touch. More specifically, the computer-readable storage device maycomprise any physical devices that provide the ability to storeinformation such as data and/or instructions to be accessed by aprocessor or a computing device such as a computer or an applicationserver.

While various embodiments have been described above, it should beunderstood that they have been presented by way of example only, and notlimitation. Thus, the breadth and scope of a preferred embodiment shouldnot be limited by any of the above-described example embodiments, butshould be defined only in accordance with the following claims and theirequivalents.

What is claimed is:
 1. A method comprising: establishing, by a processing system including at least one processor, a communication session between at least a first communication system of a first user and a second communication system of a second user, the communication session including first visual content, the first visual content including a first visual representation of the first user; detecting, by the processing system, a first action of the first visual representation of the first user in the first visual content in accordance with a first action detection model for detecting the first action, wherein the first action comprises a movement of the first visual representation of the first user in a plurality of frames of the first visual content, wherein the first action detection model is one of a plurality of action detection models applied by the processing system to detect a plurality of different actions including the first action; modifying, by the processing system in response to the detecting the first action, the first visual content in accordance with a first configuration setting, wherein the first configuration setting is associated with at least one of: a capability of the first communication system, a capability of the second communication system, a capability of the processing system, or a capability of a network supporting the communication session; and transmitting, by the processing system, the first visual content that is modified to the second communication system of the second user.
 2. The method of claim 1, wherein the first visual representation of the first user comprises: a video image of the first user; or an animated avatar associated with the first user.
 3. The method of claim 1, wherein the first visual content is generated via the first communication system.
 4. The method of claim 1, wherein the second communication system is to display the first visual content that is modified for the second user.
 5. The method of claim 1, wherein the first configuration setting specifies a first modification to apply to the first action of the first visual representation of the first user in the first visual content.
 6. The method of claim 5, wherein the first modification comprises at least one of: blocking at least a portion of the first visual representation of the first user in the first visual content; obfuscating at least a portion of the first visual representation of the first user in the first visual content; removing at least a portion of the first visual representation of the first user in the first visual content; or changing the first action of the first visual representation of the first user in the first visual content to a different action.
 7. The method of claim 1, wherein the first action detection model comprises a machine learning model for detecting the first action, wherein the machine learning model is trained based upon at least one input of the first user regarding at least one segment of visual content including at least one visual representation of at least one user.
 8. The method of claim 7, wherein the machine learning model identifies, from the at least one segment, features of the at least one visual representation of the at least one user that distinguish the first action from visual content that does not contain the first action.
 9. The method of claim 8, wherein the features are from a feature space comprising quantified aspects of the visual content.
 10. The method of claim 1, wherein the first action detection model comprises a machine learning model for detecting the first action, wherein the machine learning model is trained based upon at least one user input regarding at least one segment of visual content including at least one visual representation of at least one user.
 11. The method of claim 1, wherein the communication session includes second visual content, the second visual content including a second visual representation of the second user.
 12. The method of claim 11, further comprising: detecting a second action of the second visual representation of the second user in the second visual content in accordance with a second action detection model for detecting the second action, wherein the plurality of action detection models includes the second action detection model; and modifying, in response to the detecting the second action, the second visual content in accordance with a second configuration setting of the first user for the communication session.
 13. The method of claim 12, further comprising: transmitting the second visual content that is modified to the first communication system of the first user.
 14. The method of claim 12, further comprising: presenting the second visual content that is modified.
 15. The method of claim 12, wherein the second action detection model comprises a second machine learning model for detecting the second action, wherein the second machine learning model is trained based upon at least one user input regarding at least one segment of visual content including at least one visual representation of at least one user.
 16. The method of claim 1, wherein the first configuration setting is further associated with at least one of: a time of day; a type of task for the communication session; or a topic of the communication session.
 17. The method of claim 1, wherein the first configuration setting is further associated with at least one of: a physical location of the first user; a physical location of the second user; a presence of at least one other individual besides the first user and the second user in the communication session; or a relationship between the first user and the second user.
 18. The method of claim 1, further comprising: receiving a request to establish the communication session from at least one of: the first communication system or the second communication system.
 19. A non-transitory computer-readable medium storing instructions which, when executed by a processing system including at least one processor, cause the processing system to perform operations, the operations comprising: establishing a communication session between at least a first communication system of a first user and a second communication system of a second user, the communication session including first visual content, the first visual content including a first visual representation of the first user; detecting a first action of the first visual representation of the first user in the first visual content in accordance with a first action detection model for detecting the first action, wherein the first action comprises a movement of the first visual representation of the first user in a plurality of frames of the first visual content, wherein the first action detection model is one of a plurality of action detection models applied by the processing system to detect a plurality of different actions including the first action; modifying, in response to the detecting the first action, the first visual content in accordance with a first configuration setting, wherein the first configuration setting is associated with at least one of: a capability of the first communication system, a capability of the second communication system, a capability of the processing system, or a capability of a network supporting the communication session; and transmitting the first visual content that is modified to the second communication system of the second user.
 20. A device comprising: a processing system including at least one processor; and a computer-readable medium storing instructions which, when executed by the processing system, cause the processing system to perform operations, the operations comprising: establishing a communication session between at least a first communication system of a first user and a second communication system of a second user, the communication session including first visual content, the first visual content including a first visual representation of the first user; detecting a first action of the first visual representation of the first user in the first visual content in accordance with a first action detection model for detecting the first action, wherein the first action comprises a movement of the first visual representation of the first user in a plurality of frames of the first visual content, wherein the first action detection model is one of a plurality of action detection models applied by the processing system to detect a plurality of different actions including the first action; modifying, in response to the detecting the first action, the first visual content in accordance with a first configuration setting, wherein the first configuration setting is associated with at least one of: a capability of the first communication system, a capability of the second communication system, a capability of the processing system, or a capability of a network supporting the communication session; and transmitting the first visual content that is modified to the second communication system of the second user. 